feat(embedding): introduce instruction capability flag in model details#980
feat(embedding): introduce instruction capability flag in model details#980klamas1 wants to merge 4 commits intoKiln-AI:mainfrom
Conversation
Summary of ChangesHello @klamas1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the RAG system by introducing the capability to provide custom instructions to embedding models. This feature allows for more precise control over how documents and queries are vectorized, leading to improved semantic understanding and embedding quality, particularly for advanced models that support such inputs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds embedding-instructions support: new boolean Changes
Sequence Diagram(s)sequenceDiagram
participant UI as Client (UI)
participant Server as Studio Server
participant Adapter as LitellmAdapter
participant API as Embedding Provider API
UI->>Server: GET available embedding models
Server-->>UI: models (include supports_instructions)
UI->>Server: POST create RAG config (properties + instructions if supported)
Server->>Adapter: build options (may include instructions)
Adapter->>Adapter: _apply_instructions_to_texts(instructions, texts)
Adapter->>API: Send processed texts to Embedding Provider API
API-->>Adapter: Return embeddings
Adapter-->>Server: Return embeddings/result
Server-->>UI: Confirm creation / return result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new supports_instructions capability flag for embedding models, enabling users to provide custom instructions for models that support this feature, such as Qwen3-embedding. The changes involve updating API schemas, implementing instruction application logic in the LitellmEmbeddingAdapter, adding corresponding frontend UI elements, and updating model definitions and tests. The implementation appears robust, with appropriate validation and conditional rendering in the UI.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@app/web_ui/src/routes/`(app)/docs/rag_configs/[project_id]/create_rag_config/create_embedding_form.svelte:
- Around line 217-228: The help text for the conditional FormElement is
inconsistent: update the info_description on the FormElement (rendered when
selectedModel && selectedModel.supports_instructions) to use English (or wire it
to the app's i18n system) so it matches the rest of the UI; locate the
FormElement with id="custom_instructions" and replace the Russian string in
info_description with the English equivalent (or a localized lookup) to maintain
consistent locale across selectedModel-related instruction UI.
🧹 Nitpick comments (2)
libs/core/kiln_ai/datamodel/embedding.py (1)
14-16: Add schema metadata (and optional size limits) forinstructions.Right now
instructionsis an unconstrainedstr. Consider using Pydantic v2Annotated+Fieldto document it and optionally cap size to avoid unbounded payloads.♻️ Suggested update
-from typing import TYPE_CHECKING, List, Union +from typing import TYPE_CHECKING, List, Union, Annotated class EmbeddingProperties(TypedDict, total=False): dimensions: PositiveInt - instructions: str + instructions: Annotated[ + str, + Field( + description="Optional instructions to prepend/format embedding inputs.", + max_length=2000, + ), + ]libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)
29-37: Consider reusing validated instructions when formatting inputs.
_apply_instructions_to_textsreads raw properties whilebuild_optionsvalidates separately; passing the validated value avoids drift and keeps logic in one place.♻️ Suggested refactor
- def _apply_instructions_to_texts(self, input_texts: List[str]) -> List[str]: - """Apply instructions to input texts in the format expected by the model.""" - instructions = self.embedding_config.properties.get("instructions", None) - if not instructions: - return input_texts + def _apply_instructions_to_texts( + self, input_texts: List[str], instructions: str | None + ) -> List[str]: + """Apply instructions to input texts in the format expected by the model.""" + if not instructions: + return input_texts @@ - # Apply instructions to input texts if present - processed_texts = self._apply_instructions_to_texts(input_texts) + # Validate once and reuse the same instructions everywhere + options = self.build_options() + # Apply instructions to input texts if present (validated) + processed_texts = self._apply_instructions_to_texts( + input_texts, options.instructions + ) @@ - # Get options excluding instructions since they're applied to text - options = self.build_options() + # Get options excluding instructions since they're applied to textAlso applies to: 139-181, 200-210
...rc/routes/(app)/docs/rag_configs/[project_id]/create_rag_config/create_embedding_form.svelte
Show resolved
Hide resolved
|
@gemini-code-assist Do you have any recommendations on the PR code? |
There was a problem hiding this comment.
Code Review
This pull request introduces the supports_instructions flag to EmbeddingModelDetails and adds support for custom embedding instructions in Kiln's RAG system. It includes API schema updates, backend implementation in LitellmEmbeddingAdapter, frontend updates, testing, and documentation. The changes enable users to provide custom instructions for advanced embedding models like Qwen3-embedding, allowing more precise control over document and query vectorization. The review focuses on correctness and maintainability, identifying opportunities for improved error handling and code clarity.
|
@klamas1 - Hey thanks for the PR, will give it a try as soon as I can but seems to be looking good! Do you have an example of a use case where you used custom instructions in an embedding models? Never used it so quite curious. If you feel like it (totally optional), you could add a "paid test" which is our tests that do real calls to the provider. There is an example here. On paid tests, we have these decorators The |
|
@leonardmq, thank you for your interest in PR. In fact, quite a few models support the Instruct prefix, at least all based on qwen3-embedding 0.6b, 4b and 8b. Here is a quote from their documents https://github.com/QwenLM/Qwen3-Embedding: Just clarifying: |
|
@klamas1 - Sounds great! Code and UI look good. Couple of things we need to do before we can merge (seems Maintainer Edits are not allowed on your fork so I cannot push there directly myself). Can you do this:
Then should be all good to merge |
|
hmm, |
|
@leonardmq is Done |
|
|
||
| # Format according to Qwen3-Embedding documentation | ||
| # "Instruct: [instructions]\nQuery: [text]" | ||
| return [f"Instruct: {instructions}\nQuery: {text}" for text in input_texts] |
There was a problem hiding this comment.
Their docs suggest to that the instructions wrapper should only be used for the actual query (rather than the docs themselves).
Their README shows this:
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-0.6B', padding_side='left')
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-0.6B')That suggests we should only apply the instructions during retrieval (when embedding the query) and not during embedding of the document chunks that we index.
Much of the current logic from the PR can be preserved, and we can add a flag to the embedding methods here to specify whether or not to apply the instructions (if any), have it default to False, and pass in True from here during retrieval:
Kiln/libs/core/kiln_ai/tools/rag_tools.py
Line 212 in 706ab71
query_embedding_result = await embedding_adapter.generate_embeddings(
[query],
apply_embedding_instructions=True,
)@klamas1 - thoughts?
Looping in @tawnymanticore
There was a problem hiding this comment.
Thanks for this contribution @klamas1! great idea.
re: should we be conditioning queries or documents on instructions...
agree with @leonardmq. from that readme they are conditioning queries on the instruction so that it will have a better hit rate against documents. their example shows this
conditoned_query = "Instruct: Given a web search query, retrieve relevant passages that answer the query \nQuery:What is the capital of China?"
document = "The capital of China is Beijing"
conditoned_query.dot(document) --> high
so for the indexing portion of RAG, we should be embedding the documents as normal. then building a custom query function that gets some instructions as conditionals.
now that's what Qwen recommends anyways. does the inverse work to save on runtime compute? maaaaybeeee? Would depend if Qwen specifically fine-tuned with an Instruct/Query setup. if they did, then it must be conditoned_query.dot(document) at runtime. if this is zero shot then the inverse may be possible with the following framing
("What is the capital of China?").dot("Instruct: Given a retrieval passage, what was the original web search query? \m Passage: The capital of China is Beijing). this is probably all pretty testable in a python notebook or something
There was a problem hiding this comment.
In fact, in some cases, instructions are also needed when embedding documents, but, indeed, this is an cornercase, such as clustering.
I will add a flag to the search function.
There was a problem hiding this comment.
Later, I also want to try HyDE (https://aclanthology.org/2023.acl-long.99/).
I'm diving into this so deeply because my task is very specific; my documents are deeply nested, branched YAML configs with consistency across branches.
Maybe I'm going in the wrong direction, but if you have any ideas, I'd be happy to explore them.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
app/desktop/studio_server/provider_api.py (1)
184-184: EmbeddingModelDetails.supports_instructions` should have a default value for safety and consistency.
supports_instructions: boolis declared without a default inEmbeddingModelDetails, making it required. While both visible construction sites in this file correctly supply the value fromprovider.supports_instructions/ollama_provider.supports_instructions, adding a default value would align withKilnEmbeddingModelProvider(which hassupports_instructions: bool = Field(default=False, ...)) and protect against future code that may construct this model without explicitly passing the field.Recommended: add a default to `EmbeddingModelDetails.supports_instructions`
class EmbeddingModelDetails(BaseModel): id: str name: str n_dimensions: int max_input_tokens: int | None supports_custom_dimensions: bool suggested_for_chunk_embedding: bool - supports_instructions: bool + supports_instructions: bool = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/desktop/studio_server/provider_api.py` at line 184, EmbeddingModelDetails currently declares supports_instructions: bool without a default, making it required; add a safe default (e.g., False) to the supports_instructions field on EmbeddingModelDetails so it matches KilnEmbeddingModelProvider and prevents construction errors if callers omit the field. Update the declaration in EmbeddingModelDetails (the supports_instructions attribute) to use a default value (either supports_instructions: bool = False or supports_instructions: bool = Field(default=False, ...), keeping any existing description metadata consistent with KilnEmbeddingModelProvider) so existing callers continue to work and future callers are protected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@app/desktop/studio_server/provider_api.py`:
- Line 184: EmbeddingModelDetails currently declares supports_instructions: bool
without a default, making it required; add a safe default (e.g., False) to the
supports_instructions field on EmbeddingModelDetails so it matches
KilnEmbeddingModelProvider and prevents construction errors if callers omit the
field. Update the declaration in EmbeddingModelDetails (the
supports_instructions attribute) to use a default value (either
supports_instructions: bool = False or supports_instructions: bool =
Field(default=False, ...), keeping any existing description metadata consistent
with KilnEmbeddingModelProvider) so existing callers continue to work and future
callers are protected.
|
OMG, I can't commit, I've been defeated by prehooks I can't overcome mistakes I didn't make but I still get errors |
|
@klamas1 - Can you bypass the precommit hook and commit anyway? ( I will have a look and commit to your branch (if it lets me) - might be a minor issue with some types that we renamed a couple of days ago. |
51f6b1f to
e3f4829
Compare
|
@leonardmq I did a commit with --no-verify. However, I didn't like the solution of passing apply_embedding_instructions through multiple functions, even when it wasn't necessary. I got a little confused with the commits, so it's not in this PR. |
Add a new `supports_instructions` boolean attribute to the `EmbeddingModelDetails` class, allowing providers to specify whether their embedding models can handle instructional inputs. This update is reflected in both the general provider API connection and the Ollama-specific embedding model retrieval functions.
- Add schema constraints for instructions field - Refactor to eliminate validation duplication - Update test cases for new method signature
e3f4829 to
c68a576
Compare
Add support for custom embedding instructions
Add a new
supports_instructionsboolean attribute to theEmbeddingModelDetailsclass, allowing providers to specify whether their embedding models can handle instructional inputs. This enables users to provide custom instructions for advanced embedding models like Qwen3-embedding.What does this PR do?
This PR implements support for custom embedding instructions in Kiln's RAG system. The feature allows users to provide custom instructions to embedding models that support them (like Qwen3-embedding), enabling more precise control over how documents and queries are vectorized.
Key Changes:
API Schema Updates:
instructionsfield toEmbeddingPropertiesin the datamodelEmbeddingOptionsclass to include instructions supportsupports_instructionsflag for each modelBackend Implementation:
_apply_instructions_to_texts()method inLitellmEmbeddingAdapter"Instruct: {instructions}\nQuery: {text}"Frontend Updates:
Testing:
supports_instructionsfieldDocumentation:
Usage:
Users can now create embedding configurations with custom instructions:
{ "name": "Qwen3 with instructions", "model_provider_name": "ollama", "model_name": "qwen3-embedding:8b", "properties": { "instructions": "Use semantic similarity for document retrieval. Focus on conceptual meaning rather than exact keyword matches." } }The instructions will be applied to all texts processed by this embedding model, ensuring consistent semantic understanding across the RAG pipeline.
Related Issues
No specific issues, this is an enhancement to improve embedding quality for advanced models.
Contributor License Agreement
I, @klamas1, confirm that I have read and agree to the Contributors License Agreement.
Checklists
Summary by CodeRabbit